;;; -*- Mode: TEXT -*-
;;; File:  AutoClass:doc;Design.text
;;;--------------------------------------------------------------------------;;;
;;;  AUTOCLASS 3.0  Released 5/11/90    contact:  Taylor@pluto.arc.nasa.gov  ;;;
;;;  by P. Cheeseman, J. Stutz, R. Hanson, W. Taylor 			     ;;;
;;;  NASA Ames Research Center, MS 244-17, Moffett Field, CA 94035           ;;;
;;;                                                                          ;;;
;;;  Copyright (C) 1990 Research Institute for Advanced Computer Science.    ;;;
;;;  All rights reserved.  The RIACS Software Policy contains specific       ;;;
;;;  terms and conditions on the use of this software, and must be           ;;;
;;;  distributed with any copies.  THIS FILE MAY BE REDISTRIBUTED.  This     ;;;
;;;  copyright and notice must be preserved in all copies made of this file. ;;;
;;;--------------------------------------------------------------------------;;;
This file gives an intermediate level description of the AutoClass-3 program
design.  It was originally generated as an informal system specification and
development notebook.
Objectives:
1 - A modular implementation of the likelihood function that allows easy
extension of the set of attribute interactions that can be described.  This
contrasts with AutoClass 2 in which the single interaction type (conditional
independence) was built into the code, and the only alternative was to ignore
some attributes.
2 - User specification of the current likelihood function, or class model, at
run time.  This is envisioned as being a product of conditionally independent
likelihood terms over selected data attributes.  The user chooses the type of
interaction, for specific sets of attributes, from a range of prespecified
interaction types.
    Note that this probably pays too little attention to the problem of
searching the model space.  In the block covariant normal there are many
combinations of blocks possible.
3 - Minimize runtime by compiling the class likelihood function into a single
function over the class parameters and a datum's attributes. This has been
extended to a variety of class specific functions.
4 - For developmental flexibility it is desired to:
	a - Be able to maintain multiple classifications. 
	b - Be able to maintain multiple models.
5 - To support future research it is desired to:
	a - Define classifications that contain multiple class
likelihood models.
	b - Define hierarchical models capable of runtime optimization.
Constraints:
1 - The range of the calculated probabilities forces us to perform our
calculations in terms of logarithms.  Probability normalization may then result
in underflow, and produce nominal zeros.
Terminology:
    I will speak of probability and likelihood distribution functions
interchangeably.  They are in fact exactly the same function form, used to
calculate the probability of data with respect to fixed parameters or the
likelihood of parameters with respect to a fixed set of data.
Implementation:
    A classification is defined in terms of the probability model(s) and
classes which instantiate the model(s). The classification is made with respect
to some particular database.  A classification is implemented as a
classification-$ structure (short name is clsf-$) which contains several
parameters, a database pointer, a vector of model pointers, and an adjustable
fill-pointer vector of class pointers. See the files ..>prog>struct-model.lisp,
..>prog>struct-clsf.lisp, and ..>prog>struct-class.lisp.  Models,
classifications, and classes are implemented in an object oriented manner using
Common Lisp structures with supplementary functions having the same prefix as
the corresponding structure accessors.
    A probability model is defined in terms of attribute interactions for a
particular database.  The model partitions the attributes into subsets (type
att-set-$) whose members interact according to a particular probability
function term.  Within the model, the subsets are assumed to be conditionally
independent (given the class).  Thus the inter-set probabilities are
multiplicatively combined and the model specifies a probability function term
for each att-set.  The model also defines the parameter structure of instance
classes, provides the priors, and holds the names of the runtime compiled model
dependent functions (log-likelihood, update-xxxx, &etc.) used for generic
operations on it's classes.  A user specifies the model in an xxx.model file
that is interpreted by functions on the ..>prog>i/o-read-model file.  The
runtime definition of model/database specific functions is carried out by
invocation of the expand-model-terms function and the expand-xxx-fn functions
in file ..>prog>model-expander.
    The probability function terms specify how the element(s) of an attribute
subset interact to produce a probability.  Addition of a new probability term
type requires specification of priors,the parameter structures, and a set of
functions for the likelihood term, the class statistics based likelihood and
marginal term approximations, and certain auxiliary terms.  Runtime model
expansion will then produce a set of model specific class functions which call
the appropriate probability terms with precompiled arguments.  See the files
..>prog>model-xxx.lisp, particularly ..prog>model-expander.  The currently
defined probability terms are single-multinomial, single-normal-cn (for
Constant observation error, No missing values), and single-normal-cm (for
Constant observation error, Missing values present).
    A class is an instance of a likelihood model within a
classification.  It consists of:
	a. Structures instantiating the parameter set (and auxiliary
variables) of the model.
	b. Functions to reference the model dependent functions.
	c. A vector of the class weights (probabilities) for each datum.
	d. Various class specific parameters.
See file ...>prog>struct-class.lisp.
    A data-base contains descriptive information and a vector of data instance
vectors of attribute values.  The information consists of the source file(s)
name(s), the number of instances (n-data), number of attributes (n-atts), and
an 'att-info vector.  There is a positional correspondence between the
'att-info vector and the data instance vectors. The elements of 'att-info give
the data type (one of *att-types*), a documentation string, and type dependent
range information (see real-range-$ and disc-range-$).  Two data bases are
considered equivalent, relevant to a particular model, if all referenced
attributes have the same type and missing value, and discrete types have the
same range. The database input functions are in file i/o-read-data.lisp.
Databases are stored on a file pair: an xxx.db2 file which contains the number
of data and object vectors, and an xxx.hd2 file containing the object
descriptions.
Files
All files of the development system are in Autoclass:....  The program files
are in Autoclass:program;, data files in Autoclass:data; and experimental
results in Autoclass:results; The program is implemented as a system under the
name of autoclass with short name of ac in package ac. The system declaration
file is ...>prog>sys-dcl.lisp.  Generally those files forming a system module
will have names with a common prefix.
Operation:
See file ...>usage.text for current information on preparing, running,
and interpretation of AutoClass.